An Exploratory Data and Network Analysis of Movies
An Exploratory Data and Network Analysis of Movies
Introduction
In this report, we will be analysing a dataset from Kaggle, which contains movies of different genres produced over a vast number of years. What makes this analysis interesting is that we can try and draw various conclusions based on a movie’s popularity, directors or actors involved, year of production, and so forth. Moreover, we can construct various networks in an attempt to find meaningful and interesting results. When inspecting a database of films from recent years, various interesting inferences are uncovered. A film may have a high rating yet low return on investment (ROI). Which genre would you guess is the most successful? Which actors do you think are the most popular?
We have split our Exploratory Data Analysis into four main parts:
| Section | |
|---|---|
| 1 | Introducing the Data - We first try to understand the data and look at its content. |
| 2 | Pre-Processing - We look at what needs to be altered or removed from the dataset. - We try clean any dirty text. - We try to minimise the dataset’s missing values. |
| 3 | Exploring the Data - We conduct basic analysis on the dataset. - We explore genres. - We explore movie popularity. - We look at profit, gross, and return of interests with movies. - We conduct more advanced analysis on the dataset. |
| 4 | Network Analysis - We measure the network (centrality, degree distribution, number of components, average degree) - We use network measures to highlight certain nodes (actors) and see which measures of an actor will increase ratings and budgets. |
Admin
Before we start, let’s keep this code chunk for importing the correct libraries and loading the appropriate dataset. We use pacman to load the following:
We import the dataset like this:
In the next section we introduce our dataset and look its content.
Introducing The Dataset
This section of the report is quite essential for our analysis. We cannot make any interesting inferences from the dataset if we do not know what is contained within it. In this section we will try to understand exactly what we are dealing with. Thereafter, we can begin to draw interesting results.
The movie_metadata dataset contains 28 unique columns/variables, each of which are described in the table below:
| Variable Name | Description |
|---|---|
| color | Specifies whether a movie is in black and white or color |
| director_name | Contains name of the director of a movie |
| num_critic_for_reviews | Contains number of critic reviews per movie |
| duration | Contains duration of a movie in minutes |
| director_facebook_likes | Contains number of facebook likes for a director |
| actor_3_facebook_likes | Contains number of facebook likes for actor 3 |
| actor_2_name | Contains name of 2nd leading actor of a movie |
| actor_1_facebook_likes | Contains number of facebook likes for actor 1 |
| gross | Contains the amount a movie grossed in USD |
| genres | Contains the sub-genres to which a movie belongs |
| actor_1_name | Contains name of the actor in lead role |
| movie_title | Title of the Movie |
| num_voted_users | Contains number of users votes for a movie |
| cast_total_facebook_likes | Contains number of facebook likes for the entire cast of a movie |
| actor_3_name | Contains the name of the 3rd leading actor of a movie |
| facenumber_in_poster | Contains number of actors faces on a movie poster |
| plot_keywords | Contains key plot words associated with a movie |
| movie_imdb_link | Contains the link to the imdb movie page |
| num_user_for_reviews | Contains the number of user generated reviews per movie |
| language | Contains the language of a movie |
| country | Contains the name of the country in which a movie was made |
| content_rating | Contains maturity rating of a movie |
| budget | Contains the amount of money spent in production per movie |
| title_year | Contains the year in which a film was released |
| actor_2_facebook_likes | Contains number of facebook likes for actor 2 |
| imdb_score | Contains user generated rating per movie |
| aspect_ratio | Contains the size of the aspect ratio of a movie |
| movie_facebook_likes | Number of likes of the movie on its Facebook Page |
Furthermore, the dataset contains 5043 movies, spanning across 96 years in 46 countries. There are 1693 unique director names and 5390 number of actors/actresses. Around 79% of the movies are from the USA, 8% from UK, and 13% from other countries.
The structure of the dataset can also be used to understand our data. We can run the following code chunk to see its structure.
## 'data.frame': 5043 obs. of 28 variables:
## $ color : chr "Color" "Color" "Color" "Color" ...
## $ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## $ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
## $ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
## $ director_facebook_likes : int 0 563 0 22000 131 475 0 15 0 282 ...
## $ actor_3_facebook_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
## $ actor_2_name : chr "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
## $ actor_1_facebook_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
## $ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
## $ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
## $ actor_1_name : chr "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
## $ movie_title : chr "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
## $ num_voted_users : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
## $ cast_total_facebook_likes: int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
## $ actor_3_name : chr "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
## $ facenumber_in_poster : int 0 0 1 0 0 1 0 1 4 3 ...
## $ plot_keywords : chr "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
## $ movie_imdb_link : chr "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
## $ num_user_for_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
## $ language : chr "English" "English" "English" "English" ...
## $ country : chr "USA" "USA" "UK" "USA" ...
## $ content_rating : chr "PG-13" "PG-13" "PG-13" "PG-13" ...
## $ budget : num 237000000 300000000 245000000 250000000 NA ...
## $ title_year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
## $ actor_2_facebook_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
## $ imdb_score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
## $ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
## $ movie_facebook_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
In the next section we can start preparing the dataset for analysis by removing and simplifying some of the data.
Pre-Processing Data
In this part of the report we attempt to look for various things that may have a negative or significant impact on the inferences we make on the dataset. Once we have sufficiently cleaned and prepared the dataset, we can commence with drawing various conclusions from the graphs we generate.
Duplicate Rows
In the movie_metadata dataset, we can derive that their are 45 duplicated rows which needs to be removed and kept the unique ones.
## [1] 45
Missing Values
Let’s have a look at the number of NA values in our dataset:
## color director_name
## 0 0
## num_critic_for_reviews duration
## 49 15
## director_facebook_likes actor_3_facebook_likes
## 103 23
## actor_2_name actor_1_facebook_likes
## 0 7
## gross genres
## 874 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 13
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 21 0
## country content_rating
## 0 0
## budget title_year
## 487 107
## actor_2_facebook_likes imdb_score
## 13 0
## aspect_ratio movie_facebook_likes
## 327 0
To help visualise this, have a look at the following heatmap of the missing values:
# Visualise Missing Values
missing.values <- aggr(movie_metadata, sortVars = T, prop = T, sortCombs = T, cex.lab = 1.5, cex.axis = .6, cex.numbers = 5, combined = F, gap = -.2)##
## Variables sorted by number of missings:
## Variable Count
## gross 0.174869948
## budget 0.097438976
## aspect_ratio 0.065426170
## title_year 0.021408563
## director_facebook_likes 0.020608243
## num_critic_for_reviews 0.009803922
## actor_3_facebook_likes 0.004601841
## num_user_for_reviews 0.004201681
## duration 0.003001200
## facenumber_in_poster 0.002601040
## actor_2_facebook_likes 0.002601040
## actor_1_facebook_likes 0.001400560
## color 0.000000000
## director_name 0.000000000
## actor_2_name 0.000000000
## genres 0.000000000
## actor_1_name 0.000000000
## movie_title 0.000000000
## num_voted_users 0.000000000
## cast_total_facebook_likes 0.000000000
## actor_3_name 0.000000000
## plot_keywords 0.000000000
## movie_imdb_link 0.000000000
## language 0.000000000
## country 0.000000000
## content_rating 0.000000000
## imdb_score 0.000000000
## movie_facebook_likes 0.000000000
Gross and Budget
Since gross and budget have too many missing values (874 and 488), and we want to keep these two variables for the following analysis, we can only delete rows with null values for gross and budget because imputation will not do a good job here.
# Find NA values for gross and budget
movie_metadata <- movie_metadata[!is.na(movie_metadata$gross), ]
movie_metadata <- movie_metadata[!is.na(movie_metadata$budget), ]
dim(movie_metadata)## [1] 3857 28
The difference in observations have decreased by 4998 - 3857 = 1141 which is luckily only 22.8% of the previous total observations.
Content Rating
The dataset contains a vast range of content rating, which can be seen below:
##
## Approved G GP M NC-17 Not Rated
## 51 17 91 1 2 6 42
## Passed PG PG-13 R Unrated X
## 3 573 1314 1723 24 10
We find that M = GP = PG, X = NC-17, so let’s replace M and GP with PG, and X with NC-17, because these are (apparently) what we use today.
# Renaming content ratings
movie_metadata$content_rating[movie_metadata$content_rating == 'M'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'GP'] <- 'PG'
movie_metadata$content_rating[movie_metadata$content_rating == 'X'] <- 'NC-17' # No one under 17We want to replace Approved, Not Rated, Passed, Unrated with the most common rating R.
movie_metadata$content_rating[movie_metadata$content_rating == 'Approved'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Not Rated'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Passed'] <- 'R'
movie_metadata$content_rating[movie_metadata$content_rating == 'Unrated'] <- 'R'
movie_metadata$content_rating <- factor(movie_metadata$content_rating)
table(movie_metadata$content_rating)##
## G NC-17 PG PG-13 R
## 51 91 16 576 1314 1809
Blank spaces should be taken as missing value. Since these missing values cannot be replaced with reasonable data, we delete these rows.
Delete (Some) Rows
Let’s now have a look at how many complete cases we have.
## color director_name
## 0 0
## num_critic_for_reviews duration
## 1 0
## director_facebook_likes actor_3_facebook_likes
## 0 6
## actor_2_name actor_1_facebook_likes
## 0 1
## gross genres
## 0 0
## actor_1_name movie_title
## 0 0
## num_voted_users cast_total_facebook_likes
## 0 0
## actor_3_name facenumber_in_poster
## 0 6
## plot_keywords movie_imdb_link
## 0 0
## num_user_for_reviews language
## 0 0
## country content_rating
## 0 0
## budget title_year
## 0 0
## actor_2_facebook_likes imdb_score
## 2 0
## aspect_ratio movie_facebook_likes
## 55 0
We remove aspect_ratio because 1. it has a lot of missing values and 2. we will not be looking into the impact that it has on other data (we assume that it doesn’t).
Add a Column
Gross and Budget
We have gross and budget information. So let’s add two columns: profit and percentage return on investment for further analysis.
Remove (Some) Columns
Colour
Next, we take a look at the influence of colour vs black and white.
##
## Black and White Color
## 2 124 3680
Since 3.4% of the data is in black and white, we can remove the color column it.
Language
Let’s have a look at the different languages contained within the dataset.
##
## Aboriginal Arabic Aramaic Bosnian Cantonese
## 2 2 1 1 1 7
## Czech Danish Dari Dutch English Filipino
## 1 3 2 3 3644 1
## French German Hebrew Hindi Hungarian Indonesian
## 34 11 2 5 1 2
## Italian Japanese Kazakh Korean Mandarin Maya
## 7 10 1 5 14 1
## Mongolian None Norwegian Persian Portuguese Romanian
## 1 1 4 3 5 1
## Russian Spanish Thai Vietnamese Zulu
## 1 24 3 1 1
Almost 95% movies are in English, which means this variable is nearly constant. Let’s remove it.
Country
Next, we can look at the different types of countries.
##
## Afghanistan Argentina Aruba Australia Belgium
## 1 3 1 40 1
## Brazil Canada Chile China Colombia
## 5 63 1 13 1
## Czech Republic Denmark Finland France Georgia
## 3 9 1 103 1
## Germany Greece Hong Kong Hungary Iceland
## 79 1 13 2 1
## India Indonesia Iran Ireland Israel
## 5 1 4 7 2
## Italy Japan Mexico Netherlands New Line
## 11 15 10 3 1
## New Zealand Norway Official site Peru Philippines
## 11 4 1 1 1
## Poland Romania Russia South Africa South Korea
## 1 2 3 3 8
## Spain Taiwan Thailand UK USA
## 22 2 4 316 3025
## West Germany
## 1
Around 79% movies are from USA, 8% from UK, 13% from other countries. So we group other countries together to make this categorical variable with less levels: USA, UK, Others.
# Grouping countries
levels(movie_metadata$country) <- c(levels(movie_metadata$country), "Others")
movie_metadata$country[(movie_metadata$country != 'USA')&(movie_metadata$country != 'UK')] <- 'Others'
movie_metadata$country <- factor(movie_metadata$country)
table(movie_metadata$country)##
## Others UK USA
## 465 316 3025
Now that we’ve cleaned up our dataset, we can now continue to explore our data even further! In the next section we will be looking at genres, movie popularity, gross, profit, and many more other aspects pertinent to our data.
Analysing Data
When inspecting a dataset of movies over the past few years, various interesting inferences can be uncovered. A movie may have a high rating yet low return on investment. Which genre is the most successful? Which actors are the most popular? These are some of the questions we aim to answer in this section.
We can start by performing basic analyis on our data. Thereafter, we delve a bit deeper into more specific parts of the dataset, in hopes of uncovering interesting observations.
Basic Analysis
Let’s first have a look at the number of movies that are produced over the years.
# Plotting the number of movies released
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1916 - 2016)") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept=c(1980), linetype="dotted") +
ggplot2::annotate("text", label = "Year = 1980",x = 1979, y = 50, size = 3, colour = "blue", angle=90)From the graph, we see there aren’t many records of movies released before 1980. It’s better to remove those records because they might not be representative of the data.
Let’s have a look at the movie counts now:
# Plotting the number of movies released since 1980
ggplot(movie_metadata, aes(title_year)) +
geom_bar() +
labs(x = "Year movie was released", y = "Movie Count", title = "Number of Movies Released Per Year (1980 - 2016)") +
theme(plot.title = element_text(hjust = 0.5))The graph above illustrate the number of movies released for the period 1980 - 2016. As we can see, from the 1980’s a quick and exponential rise of movies released occurred.
Movie Genre Analysis
Now we can delve into more specific things regarding movies, like genres.
Top Genres
genre = movie_metadata['genres']
# Make genre a dataframe
genre = data.frame(table(genre))
# Order genres based on frequency
genre = genre[order(genre$Freq,decreasing=TRUE),]
# Top 20 genres with the most movies
ggplot(genre[1:20,], aes(x=reorder(factor(genre), Freq), y=Freq, alpha=Freq)) +
geom_bar(stat = "identity", fill="maroon") +
geom_text(aes(label=Freq),hjust=1.2, size=3.5)+
xlab("Genre") +
ylab("Number of Movies") +
ggtitle("Top 20 Genres with the most movies") +
coord_flip()From the above a combination of Comedy, Romance, and Drama appears to be, by far, the most frequent produced genres. As you can see, movies have multiple genres that they are associated with. For analysis purposes, we choose to use the first word in the genre column, as this is likely to be the most accurate description of the movie.
Split Genres
Here we first split the genres into multiple columns and merge them together.
## [1] "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy"
## [3] "Action|Adventure|Thriller" "Action|Thriller"
## [5] "Action|Adventure|Sci-Fi" "Action|Adventure|Romance"
Let’s split the genres separated by “|” into 8 different columns.
# Split on "|"
genres_split <- str_split(movie_metadata$genres, pattern="[|]", n=2)
# Create Matrix
genres_matrix <- do.call(rbind, strsplit(movie_metadata$genres, '[|]'))
# Dataframe of genres
genres_df <- as.data.frame(genres_matrix)genre_df consists of 8 columns, each with different genres. Let’s have a look at the frequency of ALL the genres.
# Collapse all genres into one column
genres_one_col <- gather(genres_df) %>%
select(value)
# Top 30
top30 <- genres_one_col %>%
group_by(value) %>%
tally() %>%
filter(n >= 30)
# Plot frequency of first column
top30 %>%
ggplot(aes(x=reorder(factor(value), n), y=n, alpha=n)) +
geom_bar(stat="identity", fill="maroon") +
geom_text(aes(label=n),hjust=1.2, size=3.5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Genre") +
ylab("Frequency") +
ggtitle("Movie Genre Frequency") +
coord_flip()It is evident that the Drama and Comedy genre are still the most popular to be produced. It is also interesting to note that Romance is fairly lower on the list as in the previous graph. This may imply that most movies predominantly co-occur with Comedy or Drama. Romance movies do not co-occur with any of the other genres as frequently as Comedy and Drama do. Additionally, the fact that Comedy and Drama occurs the most does not necessarily mean that they are the most profitable, returning successful ROI’s. We will try and explore this later on in the report.
Previously we assumed that the first genre is the most applicable, therefore, we choose the first column as the genre for the movie and append it to the dataframe.
# Remove old genre column
movie_metadata <- subset(movie_metadata, select = -c(genres))
# Take first column of genres_df and add it to MAIN df
movie_metadata$genre <- genres_df$V1How does this distribution look like over the years? Lets have a look at the frequency of genres between the period of 1980 and 2016.
# Plotting the movie genres produced
movie_metadata %>%
group_by(title_year, genre) %>%
summarise(count = n()) %>%
ggplot(aes(title_year, as.factor(genre))) +
geom_tile(aes(fill=count),colour="white") +
scale_fill_gradient(low="light blue",high = "dark blue") +
xlab("Year of Movie") +
ylab("Genre of Movie") +
ggtitle("Heat Map of Movie Genres Produced Over the Years") +
theme(panel.background = element_blank())We can make one or two remarks from this heatmap: Firstly, we can see that Action, Adventure,Comedy, and Drama are genres that are predominantly used as the first term to associate with a movie. Secondly, we can see that Romance, which was previously high in frequency for co-occurring with Comedy and Drama is now very low. This means that there are very few movies that are predominantly Romance, meaning Romance is mostly the second, third, etc term to describe a movie.
Additionally, we can also see that some genres, like Action and Comedy have picked up over the years. This is evident by looking at the darker shades of blue becoming more prominent in the latter years.
Which Genres are Popular?
In our dataset we have Facebook Likes and IMDB scores associated with a movie. This can help give us an indication of how popular each genre is. There are different types of Facebook likes within the dataset, namely director_facebook_likes, actor_3_facebook_likes, actor_1_facebook_likes, cast_total_facebook_likes, actor_2_facebook_likes, and finally movie_facebook_likes. Let’s add them all together for each movie.
# Add column for total facebook likes
movie_metadata$total_facebook_likes <- movie_metadata$director_facebook_likes +
movie_metadata$actor_3_facebook_likes + movie_metadata$actor_1_facebook_likes +
movie_metadata$cast_total_facebook_likes + movie_metadata$actor_2_facebook_likes +
movie_metadata$movie_facebook_likesNow let’s calculate the average IMDB score, average user votes, average Facebook likes and average number of reviewers for each genre.
# creating a data frame containing avg score, avg votes and avg fb likes
score_votes_likes <- movie_metadata %>%
group_by(genre) %>%
summarise(count = n(),
avg_score = round(mean(imdb_score), 1),
avg_votes = mean(num_voted_users),
avg_facebook_likes = mean(total_facebook_likes),
avg_reviews = mean(num_user_for_reviews)) %>%
filter(count > 10)
# arranging data frame by average score
arr_score <- arrange(score_votes_likes, desc(avg_facebook_likes))
# Show data frame
as.data.frame(score_votes_likes)## genre count avg_score avg_votes avg_facebook_likes avg_reviews
## 1 Action 937 6.3 145049.12 38762.692 462.1942
## 2 Adventure 356 6.5 133549.07 40494.340 355.0365
## 3 Animation 43 6.7 83318.51 30471.674 202.0698
## 4 Biography 200 7.1 100233.72 41000.900 260.4650
## 5 Comedy 997 6.1 64763.60 23896.765 197.9388
## 6 Crime 247 6.9 130843.95 33857.397 362.5789
## 7 Documentary 32 6.8 16159.19 5623.312 126.3750
## 8 Drama 667 6.8 85462.68 31889.940 319.7301
## 9 Fantasy 37 6.3 87311.14 25452.270 374.7027
## 10 Horror 152 5.7 66091.12 21356.158 406.8750
## 11 Mystery 24 6.6 180824.04 35571.625 570.2500
The most liked genres on Facebook is Biography (41000.900), Adventure (40494.340), and Action (38762.692). The most popular genres, based on IMDB scores, are Biography (7.9), Crime (6.9), Documentary (6.8), and Drama (6.8). It may be interesting to note that previously we saw that Comedy and Drama are produced the most frequently overall - yet it seems like the public scores Biography, Crime, Documentary, and Drama movies consistently high. This may also imply that there could be a vast range in scores for Comedy and Drama, justifying a fairly lower score than the rest. Furthermore, Comedy, Action, Drama have a very high number of voters - which may account for the lower overall average.
Popularity Analysis
IMDB ratings VS Movie Count
Let’s have a look at the IMDB rating distribution on the number of movies that are produced. Below we can see that the data is slightly skewed to the left. We have a concentration of data among 6 out of 10 and a long tail to the left. The vast majority of the movies are given a score between 5 and 7.5, with fewer movies scoring higher than that.
# Plotting the IMDB ratings vs movie count
ggplot(movie_metadata, aes(imdb_score)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = mean(movie_metadata$imdb_score,na.rm = TRUE),colour = "blue") +
ylab("Movie Count") +
xlab("IMDB Rating") +
scale_x_continuous(limits = c(0, 10)) +
ggtitle("IMDB Ratings for Movies") +
ggplot2::annotate("text", label = "Mean IMDB rating",x = 6.2, y = 50, size = 3, colour = "yellow",angle=90)The lowest scored movie, titled Justin Bieber: Never Say Never, is 1.6 whereas the highest score is 9.3 for The Shawshank Redemption. The mean of the imdb scores is 6.433288.
Popularity over the years
Let’s take a look at when Facebook started. From the graph it is clear that the number of Facebook likes for movies released post 2004 have increased dramatically. Old movies receive fewer likes, which is most likely due to Facebook marketing newer movies.
#Creating the required subset of data
movies_pop <- movie_metadata %>%
select(title_year, movie_facebook_likes) %>%
filter(title_year > 1980) %>%
group_by(title_year) %>%
summarise(avg = mean(movie_facebook_likes))
#Generating the popularity vs time plot
ggplot(movies_pop, aes(x = title_year, y = avg)) +
geom_point() +
geom_smooth() +
geom_vline(xintercept = c(2004),colour = c("blue")) +
ylab("Average Facebook Likes") +
xlab("Years Movie Was Produced") +
ggplot2::annotate("text", label = "Facebook",x = 2003.5, y = 10000, size = 3, colour = "blue",angle=90)Facebook Likes VS IMDB Score
# Plotting the Facebook Likes VS IMDB Score
movie_metadata %>%
group_by(content_rating) %>%
summarise(avg_fb_likes = mean(movie_facebook_likes), avg_imdb_score = mean(imdb_score), num=n()) %>%
ggplot() +
geom_point(aes(x=avg_fb_likes, y=avg_imdb_score, color=content_rating, size=num), stat="identity") +
scale_y_continuous(limits = c(0, 10)) +
xlab("Average Facebook Likes") +
ylab("Average imdb score")There are three things to look at with this graph: Average IMDB score, average Facebook likes and the number of movies rated per content-rating. We can see that all movies have (on average) very similar IMDB scores, however, they differ higly on the number of Facebook likes. For example, movies rated PG-13 receive much more Facebook likes (on average) than movies rated NC-17. Movies rated R receive (on average) more likes than movies rated NC-17, but still have relatively similar IMDB scores.
IMDB Scores VS Facebook Likes
We can infer a strong correlation between a movie’s Facebook likes and its IMDB Score. This is expected, as a higher individual rating relates to higher viewer satisfaction; and hence it is expected to see an increase in positive online presence. Initially, this graph was constructed to see if there’d be a difference between viewer enjoyment and movie rating. Movie databases are often critisized for the nature of their rating scales, made by critics and placing priority on sentiments and plot, which may not fully coincide with viewer enjoyment. However, as seen below, this is not the case using IMDB’s Scoring.
#Plotting Facebook likes against IMDB score
ggplot(data = movie_metadata, aes(x = imdb_score, y = movie_facebook_likes)) +
scale_x_continuous(limits = c(0, 10)) +
geom_smooth()Vote Counts
The IMDB rating system was first implemented in the 1990’s. Social media platforms like Facebook had started in the mid 2000’s. It is evident that IMDB caused rapid growth in vote counts which was later amplified with the introduction of Facebook.
#Performing operations on Movies Vote Count over the years
movies_vote1 <- movie_metadata %>%
select(title_year, num_voted_users) %>%
group_by(title_year) %>%
summarise(count = sum(num_voted_users))
ggplot(movies_vote1, aes(x = title_year, y = count/1000)) +
geom_bar( stat = "identity") +
geom_vline(xintercept = c(1990,2004),colour = c("orange","blue")) +
ylab("Vote count (in thousands)") +
xlab("Years") +
scale_x_continuous(limits = c(1980, 2014)) +
ggplot2::annotate("text", label = "Facebook",x = 2003, y = 21000, size = 3, colour = "blue",angle=90) +
ggplot2::annotate("text", label = "IMDB",x = 1989, y = 21000, size = 3, colour = "orange",angle=90)Top 20 directors with highest average IMDB score
Let’s take a look at the directors. We can see that the top IMDB rated directors have very similar scores (8.1 - 8.6). Tony Kaye has the highest rating of 8.6.
# Displaying the avg_imdb score of directors
movie_metadata %>%
group_by(director_name) %>%
summarise(avg_imdb = mean(imdb_score)) %>%
arrange(desc(avg_imdb)) %>%
top_n(20, avg_imdb) %>%
formattable(list(avg_imdb = color_bar("orange")), align = 'l')| director_name | avg_imdb |
|---|---|
| Tony Kaye | 8.600000 |
| Damien Chazelle | 8.500000 |
| Majid Majidi | 8.500000 |
| Ron Fricke | 8.500000 |
| Christopher Nolan | 8.425000 |
| Asghar Farhadi | 8.400000 |
| Marius A. Markevicius | 8.400000 |
| Richard Marquand | 8.400000 |
| Sergio Leone | 8.400000 |
| Lee Unkrich | 8.300000 |
| Lenny Abrahamson | 8.300000 |
| Pete Docter | 8.233333 |
| Hayao Miyazaki | 8.225000 |
| Joshua Oppenheimer | 8.200000 |
| Juan José Campanella | 8.200000 |
| Quentin Tarantino | 8.200000 |
| David Sington | 8.100000 |
| Je-kyu Kang | 8.100000 |
| Terry George | 8.100000 |
| Tim Miller | 8.100000 |
Profit | Gross | Return of Interest
In this part of the report we look at the budgets that were allocated as well as the gross profit that was achieved after the movie was released. We can observe that even though The Host had the highest budget, it doesn’t appear to have generated a gross profit within the top 15. In fact the highest grossing movie is Avatar, which isn’t even in the top 15 for highest allocated budget.
# Plotting movie budget and gross
budget <- movie_metadata %>%
select(movie_title, budget) %>%
arrange(desc(budget)) %>%
head(15)
x <- ggplot(budget, aes(x = reorder(movie_title, -desc(budget)), y = budget/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Top 15 Highest Movie Budgets")+
xlab("")+
ylab("Budget (in Millions)") +
coord_flip()
rev <- movie_metadata %>%
select(movie_title, gross) %>%
arrange(desc(gross)) %>%
head(15)
y <- ggplot(rev, aes(x = (reorder(movie_title, -desc(gross))), y = gross/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Top 15 Highest Grossing Movie")+
xlab("")+
ylab("Gross (in Millions)") +
coord_flip()
ggarrange(x, y,
labels = c("A", "B"),
ncol = 1, nrow = 2)In summary, it is evident that the movies with with higher budgets do not essentially mean that they will equate to a high gross profit. We investigate this claim in the following graph.
Top 20 movies based on its Return on Investment
#Top 20 movies based on its Return on Investment
movie_metadata %>%
filter(budget >100000) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=return_on_investment_perc)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Percent Return on Investment") +
ggtitle("20 Most Profitable Movies based on its Return on Investment")
Sucessful directors such as George Lucas also have profitable movies. These are the top 20 movies based on its Percentage Return on Investment. ((profit/budget)*100). Since profit earned by a movie does not give a clear picture about its monetary success over the years (1980 to 2016), this analyses, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results. It is interesting to note that the ROI is high for low budget films and decreases as the budget of the movie increases.
Most Successful Directors Based on Profit
#Top 20 most successful directors
movie_metadata %>%
group_by(director_name) %>%
mutate(profit = gross - budget)%>%
select(director_name, budget, gross, profit) %>%
na.omit() %>%
summarise(films = n(), budget = sum(as.numeric(budget)), gross = sum(as.numeric(gross)), profit = sum(as.numeric(profit))) %>%
mutate(avg_per_film = profit/films) %>%
arrange(desc(avg_per_film)) %>%
top_n(20, avg_per_film) %>%
ggplot( aes(x = films, y = avg_per_film/1000000)) +
geom_point(size = 1, color = "maroon") +
geom_text_repel(aes(label = director_name), size = 3, color = "maroon") +
xlab("Number of Films") + ylab("Avg Profit $millions") +
ggtitle("Most Successful Directors")These are the top 20 most successful directors based on the average profit earned by the movies they directed. There’s an obvious downward trend between average profit and number of films. This could be because as one makes more movies, one could have more hits and misses, therefore the average goes down. It could also be due to film makers having a more diverse range of movies which could include smaller and/or high budget movies" Looking at the most succesful directors, one can determine that the most successful director produced only one movie (Tim Miller) or creating an array of succesful films with large budgets, such as James Cameron.
Top 20 movies based on its Profit
#Top 20 movies based on its Profit
movie_metadata %>%
filter(title_year %in% c(2000:2016)) %>%
mutate(profit = gross - budget,
return_on_investment_perc = (profit/budget)*100) %>%
arrange(desc(profit)) %>%
top_n(20, profit) %>%
ggplot(aes(x=budget/1000000, y=profit/1000000)) +
geom_point(size = 2) +
geom_smooth(size = 1) +
geom_text_repel(aes(label = movie_title), size = 3) +
xlab("Budget $million") +
ylab("Profit $million") +
ggtitle("20 Most Profitable Movies")These are the top 20 movies based on the Profit earned (Gross Earnings - Budget). It can be inferred from this plot that high budget movies tend to earn more profit. The trend is almost linear, with profit increasing with the increase in budget. When assessing the top 20 movies based on profit, Avatar has the highest profit margin, regioning in a similar area to director James cameron.
Further Analysis
Commercial success Vs Critical acclaim
# Plotting of the commercial success vs critical accliam
movie_metadata %>%
top_n(15, profit) %>%
ggplot(aes(x = imdb_score, y = gross/10^6, size = profit/10^6, color = content_rating)) +
geom_point() +
geom_hline(aes(yintercept = 600)) +
geom_vline(aes(xintercept = 7.75)) +
geom_text_repel(aes(label = movie_title), size = 4) +
xlab("Imdb score") +
ylab("Gross money earned in million dollars") +
ggtitle("Commercial success Vs Critical acclaim") +
ggplot2::annotate("text", x = 8.5, y = 700, label = "High ratings \n & High gross") +
theme(plot.title = element_text(hjust = 0.5))In the above graph we can compare content rating to the content rating inference as well as the higher grossing films with sucessful directors.
The top 20 most popular key word
# Plotting the 20 most popular keywords
movie_metadata %>%
filter(imdb_score >6 ) %>% #Filter to show the graph only for vote_average values greater than 6
na.omit() %>%
ggplot(aes(x = gross/1000000, y = content_rating), height=0) +
geom_jitter(alpha = 0.5, col = "darkgreen") +
theme(axis.text.x=element_text(angle = 90, hjust = 1))+
ggtitle("Gross vs Content Rating") +
xlab("Gross (millions)") +
ylab("Rating") +
geom_smooth() +
coord_flip()Correlation Heatmap
# Plotting the heatmap
ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size = 2.8, size = 2, hjust = .85) +
ggtitle("Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))## Warning in ggcorr(movie_metadata, label = TRUE, label_round = 2, label_size
## = 2.8, : data in column(s) 'director_name', 'actor_2_name', 'actor_1_name',
## 'movie_title', 'actor_3_name', 'plot_keywords', 'movie_imdb_link',
## 'language', 'country', 'content_rating', 'genre' are not numeric and were
## ignored
Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors.
According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes, and both actor2 and actor3 are also somehow correlated to the total. So we want to modify them into two variables: actor_1_facebook_likes and other_actors_facebook_likes.
There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews.
Sentiment Analysis
Exploration into Movie Plot Keywords
Keywords from movie plot lines are given in a single column, split by a “|”. The following code will seperate these words into seperate columns.
#Constructing Top 20
keywords_split <- str_split(IMDB$plot_keywords, pattern="[|]", n=5)
keywords_matrix <- do.call(rbind, strsplit(as.character(IMDB$plot_keywords), "[|]"))
keywords_df <- as.data.frame(keywords_matrix)
names(keywords_df) <- c("one", "two", "three", "four", "five")
keywords_one_col <- gather(keywords_df) %>%
select(value)
keywords_one_col_freq <- keywords_one_col %>%
group_by(value) %>%
tally()
top_20 <- keywords_one_col_freq %>%
select(value, n) %>%
top_n(20)## Selecting by n
keywords_one_col %>%
group_by(value) %>%
tally() %>%
filter(n > 30) %>%
ggplot() +
geom_bar(aes(x = value, y=n), stat="identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Keyword") +
ylab("Frequency") +
ggtitle("Frequency of common keywords")
Above we can see the most popular keywords describing the films in the dataset.
The following code checks to see which movies contain common keywords.
#Placing Top 20 Words Against Movie Success
IMDB_true_false <- IMDB
for (keyword in top_20$value) {
IMDB_true_false <- cbind(IMDB_true_false, ifelse(str_detect(IMDB$plot_keywords, keyword), "TRUE", "FALSE"))
}
for (i in 1:20) {
reference <- 27
names(IMDB_true_false)[reference + i] <- top_20$value[i]
}
does_contain_common_key <- data.frame()
for (keyword in top_20$value) {
does_contain_common_key <- rbind(does_contain_common_key, IMDB_true_false %>%
filter(get(keyword) == TRUE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}
does_not_contain_common_key <- data.frame()
for (keyword in top_20$value) {
does_not_contain_common_key <- rbind(does_not_contain_common_key, IMDB_true_false %>%
filter(get(keyword) == FALSE) %>% select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords)) %>%
distinct(movie_title, .keep_all = T)
}## movie_title
## 1 Coach Carter
## 2 Me, Myself & Irene
## plot_keywords
## 1 basketball|basketball coach|coach|contract|high school
## 2 dissociative identity disorder|limousine|multiple personality|police|rhode island
Here are two movies that contain at least one common plot keyword.
## movie_title
## 1 The Island of Dr. Moreau
## 2 Babel
## plot_keywords
## 1 animal experimentation|chimera|island|jungle|mutant
## 2 american|destiny|mexican border|multiple perspectives|muslim
Here are two movies that do not contain any of the common plot key words.
does_contain_common_key <- does_contain_common_key %>%
mutate(tri = "Top 20 Word")
does_not_contain_common_key <- does_not_contain_common_key %>%
mutate(tri = "NOT Top 20 Word")
all_movies_keywords_indicated <- full_join(does_not_contain_common_key, does_contain_common_key, by = c("movie_title", "gross", "imdb_score", "movie_facebook_likes", "plot_keywords"))
all_movies_keywords_indicated <- all_movies_keywords_indicated %>%
mutate(type = coalesce(tri.y, tri.x)) %>%
select(movie_title, gross, imdb_score, movie_facebook_likes, plot_keywords, type)
all_movies_keywords_indicated <- all_movies_keywords_indicated %>%
group_by(type) %>%
na.omit() %>%
mutate(avg_gross = mean(gross))summarise(all_movies_keywords_indicated, avg_gross = mean(gross)) %>%
ggplot() +
geom_bar(aes(x = type, y=avg_gross, fill=type), stat="identity", position="stack") +
theme(axis.text.x = element_blank()) +
xlab("Type of movie") +
ylab("Average Gross")
This graph is comparing the average gross of a movie that contains at least one keyword from the list of top 20 most common plot keyword and movies that contain none of these popular keywords.
The average gross for movies that do not contain one of the top 20 most common keywords appear to be higher.
Sentiment pre-processing
The split keywords are needed in order to perform sentiment.
Calculating Sentiment
Below is where the sentiment of keywords is calculated. The sentiment function used comes from the syuzhet package and it can detect the presence of eight different emotions, namely “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise” and, “trust”. It is also able to calculate positive and negative valence.
#Function for sentiment per year
yearly_sentiment <- function(year, df) {
amount <- nrow(df %>%
select(Year) %>%
filter(Year == year))
df <- df %>%
filter(Year == year)
sentiments <- get_nrc_sentiment(as.character(df[4]))
for (i in 1:length(sentiments)) {
sentiments[i] <- sentiments[i]/amount
}
year_sentiment <- cbind(year, sentiments)
return (year_sentiment)
}sentiments <- data.frame()
#For-loop to capture all years
for (i in all_years$Year) {
sentiments <- rbind(sentiments,yearly_sentiment(i, keywords_from_split))
}Sentiment results
As previously mentioned, thre number of movies dramatically increases from 1980 onwards so the calculated sentiments are filtered to reduce sensitivity of the data.
#Sort by year
sentiments_emotions <- sentiments[with(sentiments, order(year)), ] %>%
filter(year >= 1980) %>%
select(-positive, -negative)#Heatmap for sentiments
rnames <- sentiments_emotions[,1]
mat_sentiments <- data.matrix(sentiments_emotions[,2:ncol(sentiments_emotions)])
rownames(mat_sentiments) <- rnames
mat_sentiments <- t(mat_sentiments)
df_sentiment <- as.data.frame(mat_sentiments)
names_emotions <- c("anger", "anticipation", "disgust","fear","joy","sadness","surprise","trust")
sentiments_graph <- cbind(names_emotions, df_sentiment)#Heatmap
heatmap.2(mat_sentiments, Rowv=NA, Colv=NA, scale="row", col=colorRampPalette(c("white","darkblue")), margins=c(5,10), trace = "none")
What is immeadiately visible is the differences in type of emotions in plot keywords has reduced over the years. The movies
Below the trend of positive and negative sentiment is explored.
sentiments_filter %>%
ggplot(aes(x = year)) +
geom_line(aes(y = negative, color = 'postive')) +
geom_line(aes(y = positive, color = 'negative')) +
ylab("Sentiment score") +
xlab("Year") +
ggtitle("Positive and negative sentiment of keywords across the years")
Generally speaking, the sentiment of common keywords used has been more positive, but has remained relatively constant, apart from the decrease in both positive and negative sentiment from around 1995 to 2010.